STAT331 Final Project Report

Author

Thien An Tran, Tejasree Kandibanda, Matthew Huang, Chloe Anbarcioglu

Introduction

For our project, we want to explore the relationship between murder rate and happiness score in each country per year. We hypothesize that there will be a negative correlation between the two meaning that the higher the murder rates, the lower happiness scores. This could potentially indicate that safety and security are significant factors in overall happiness.

We obtained our data from Gapminder (n.d.).

population: contains total population counts for each country per year.
murder: contains total number of estimated deaths from interpersonal violence for each country per year.
happiness: contains happiness score (converted to 0 to 100 scale to be in terms of percentage) for each country per year.

1. Data Cleaning

In order to find the relationship between murder rates and happiness score for each country per year, we need to merge the total murders data set and population data set. That way, we can get the murder rate per 100K people.

Note

We first have to convert all the values into numbers (i.e. fix cases such as 1.1k to be 1100).

Code
convert_value <- function(val) {
  val <- as.character(val)
  
  multiplier <- case_when(
    str_detect(val, "k") ~ 1e3,
    str_detect(val, "M") ~ 1e6,
    str_detect(val, "B") ~ 1e9,
    TRUE ~ 1
  )
  
  numeric_value <- as.numeric(str_remove_all(val, "[kMB]"))
  
  return(numeric_value * multiplier)
}

murder_clean <- murder |>
  select(country, `2005`:`2019`) |> 
  pivot_longer(cols = `2005`:`2019`,
               names_to = "year",
               values_to = "murder_count") |> 
  mutate(across(murder_count, ~convert_value(.)))

murder_clean

population_clean <- population |>
  select(country, `2005`:`2019`) |>
  pivot_longer(cols = `2005`:`2019`,
               names_to = "year",
               values_to = "population") |>
  mutate(across(population, ~convert_value(.)))

population_clean

After cleaning the total murders and population data set, we can proceed to merging them to get a data set of the murder rate per 100k people for each country and year. We can then use pivot longer to transform the happiness score data set and merge it with the murder rate per 100k data set to get our final data set.

Code
murder_pop_merged <- murder_clean |>
  inner_join(population_clean, by = c("country", "year"))

murder_rate_clean <- murder_pop_merged |>
  mutate(murder_rate_per_100k = (murder_count / population) * 100000)

happiness_clean <- happiness |>
  select(country, `2005`:`2019`) |>
  pivot_longer(cols = `2005`:`2019`,
               names_to = "year",
               values_to = "happiness_score") |>
   drop_na(happiness_score)

happiness_merged <- murder_rate_clean |>
  inner_join(happiness_clean, by = c("country", "year"))

happiness_merged |>
  head() |>
  knitr::kable(digits = 4) 
country year murder_count population murder_rate_per_100k happiness_score
Afghanistan 2008 3780 26400000 14.3182 37.2
Afghanistan 2009 3870 27400000 14.1241 44.0
Afghanistan 2010 4130 28200000 14.6454 47.6
Afghanistan 2011 4170 29200000 14.2808 38.3
Afghanistan 2012 4240 30500000 13.9016 37.8
Afghanistan 2013 4380 31500000 13.9048 35.7

Final Dataset

The final data set contains 1,820 rows and 6 columns.

The columns are country, year, murder_count, population, murder_rate_per_100k, and happiness_score.

It provides a comprehensive overview of the murder rates and happiness scores across various countries and years. Each entry in the data set corresponds to a unique combination of a country and a year (ranging from 2005 to 2019).

2. Linear Regression

In this part, we will use linear regression to model the relationship between our two quantitative variables, murder rate and happiness score. We want to see if our hypothesis from above holds true and whether there is a negative correlation between the two variables.

Data Visualizations

Here, we will be creating two data visualizations that explore the relationship between our two quantitative variables.

Relationship between Murder Rate and Happiness Score Over Time

Code
murder_happiness_summary <- murder_happiness |>
  group_by(country, year) |>
  summarise(avg_murder_rate = mean(murder_rate_per_100k),
            avg_happiness_score = mean(happiness_score)) |>
  ungroup()

animated_plot <- ggplot(murder_happiness_summary,
                        aes(x = avg_murder_rate,
                            y = avg_happiness_score)) +
  geom_point(color = "steelblue") +
  geom_smooth(method = "lm", color = "black") +
  labs(title = "Relationship Between Murder Rate and Happiness Score (2005-2019)",
       subtitle = "Average Happiness Score",
       x = "Average Murder Rate (per 100k)",
       y = "",
       caption = "Year: {round(frame_time)}") +
  transition_time(year) +
  enter_fade() +
  exit_fade() +
  theme_bw() +
  theme(plot.caption = element_text(size = 10))

animate(animated_plot, renderer = gifski_renderer())

In this visualization, we have plotted the murder rate per 100k people versus the happiness score from 2005 to 2019 for each country. Contrary to our initial prediction of seeing a negative correlation, as in having lower murder rates for a higher happiness score, we see that there’s not much of a correlation between the two. The average murder rate is mostly between 0%-25% for each year and the happiness score has a wide range from 20-80 with no apparent correlation.

Relationship Between Murder Rate and Happiness Score

Code
country_murder_happiness <- murder_happiness |>
  group_by(country) |>
  summarise(avg_murder_rate = mean(murder_rate_per_100k),
            avg_happiness_score = mean(happiness_score))

country_murder_happiness |>
  ggplot(aes(x = avg_murder_rate, 
             y = avg_happiness_score)
         ) +
    geom_point(color = "steelblue") +
    geom_smooth(method = "lm", color = "black") +
    labs(title = "Relationship Between Murder Rate and Happiness Score",
         subtitle = "Average Happiness Score",
         x = "Average Murder Rate (per 100k)", 
         y = "") +
    theme_bw()

This plot shows the relationship between the average murder rate per 100k for each country and their overall happiness score. Similar to our previous plot, there is not much correlation between the murder rate and happiness such as an increasing or decreasing happiness score based on the rate of murder.

Linear Regression

We will be using linear regression as a statistical method to model the relationship between murder rate and happiness score. Then we will use this to evaluate the model fit.

x (explanatory): average murder rate per 100K people
y (response): average happiness score

Code
linear_model <- lm(avg_happiness_score~avg_murder_rate, country_murder_happiness)
broom::tidy(linear_model) |>
  knitr::kable(digits = 4)

Predicted Happiness Score = 54.37738 - 0.05908 x Average Murder Rate

According to the linear regression model, when the murder rate is zero, the predicted happiness score is estimated to be 54.38. Additionally, the analysis suggests that for each additional murder per 100,000 people, the predicted happiness score decreases by approximately 0.0591 points.

Model Fit

Code
var_response <- var(country_murder_happiness$avg_happiness_score)
var_fitted <- var(linear_model$fitted.values)
var_resid <- var(linear_model$residuals)
explained_variation <- var_fitted/var_response

table_data <- data.frame(
  Variable = c("Response Variable Variance", "Fitted Values Variance", "Residuals Variance", "Explained Variation"),
  Value = c(var_response, var_fitted, var_resid, explained_variation)
)

table_data |>
  knitr::kable(digits = 4) 
Variable Value
Response Variable Variance 114.7320
Fitted Values Variance 0.3622
Residuals Variance 114.3697
Explained Variation 0.0032

The explained variation is 0.00315, which is almost 0. This means that murder rate explains almost none of the variability in happiness. Almost all of the variability in happiness is unaccounted for.

3. Simulation

Code
predictions <- predict(linear_model, country_murder_happiness)
residual_se <- sigma(linear_model)
simulated_y <- predictions + rnorm(n = length(predictions), mean = 0, sd = residual_se)
Code
# Plot Observed Data
p1 <- ggplot(country_murder_happiness, 
             aes(x = avg_murder_rate, 
                 y = avg_happiness_score)
             ) +
  geom_point(color = "steelblue") +
  labs(title = "Observed Data",
       subtitle = "Average Happiness Score",
       x = "Average Murder Rate", 
       y = "") +
  theme_bw()

# Plot Simulated Data
p2 <- ggplot(country_murder_happiness, 
             aes(x = avg_murder_rate, 
                 y = simulated_y)
             ) +
  geom_point(color = "orange3") +
  labs(title = "Predicted Data",
       subtitle = "Predicted Happiness Score",
       x = "Average Murder Rate", 
       y = "") +
  theme_bw()

# Combine plots with patchwork
p1 + p2